-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize repeated pointer accesses in op/base reduction functions #9717
Conversation
Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
Can one of the admins verify this patch? |
ok to test |
I'm a little surprised that we left that pointer-dereference assignment in all those loops, but I guess it just got missed. I double checked MPI-4.0 to make sure that the len argument isn't an INOUT, and it doesn't appear to be. MPI-4.0 p235:13-16 says:
@bosilca Can you verify? |
@jsquyres the Git commit checker did not run (even if you issue the |
@gkatev thanks for the patch! Out of curiosity, did you make any performance measurements (with Open MPI compiled for production - e.g. |
This might be an aliasing issue: the compiler cannot assume that |
@ggouaillardet I think I am not seeing a difference with O2/O3, but am I doing it correctly like this? |
@gkatev I believe you'll need to pass in CFLAGS at configure time for it to have any effect. you can check with |
FWIW in past projects I have seen these little optimizations making a difference. Compilers should be smart enough, but can't always rely on that. |
Check the assembly at godbolt for 4 different versions of the sum int32_t. Depending on the compiler, the generated code is slightly different. clang seems to assume the pointer to count is restrict and always generates the same optimized code, while gcc is taking a very cautious approach. In general the difference is reloading the *count from the cache instead of working on it from a register. Conclusion, to remove the compiler from the equation and always be sure the code is as optimal as possible either we add restrict or we do the trick proposed here (temporary variable). |
However, I think I would rather have the code with the following pattern: void
ompi_op_base_2buff_sum_int32_t(const void *in, void *out, int *count)
{
int i;
int32_t *a = (int32_t *) in;
int32_t *b = (int32_t *) out;
for (i = *count; i > 0; i--) {
*(b++) += *(a++);
}
} |
@bosilca Just curious: why keep it incrementing through a pointer dereference? |
Not sure I understand the question. Are we talking about count or about the other pointers ? |
@bosilca You're right... my question was non-sensical. Disregard! |
I thought the point of @bosilca 's comment was to walk backwards ( |
Ahh never mind, the pointers are still walking forward 🤦 |
Remove all ops with 3 buffers, we ended up not using them anywhere in the code. Change the loop order in the base MPI_Op to allow for more optimizations, as discussed in open-mpi#9717. Fixes open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
I created #9719 as an alternative to this. While digging in the code I also noticed we are not using the 3buff MPI_Op so I took the opportunity to remove all support. |
Remove all ops with 3 buffers, we ended up not using them anywhere in the code. Change the loop order in the base MPI_Op to allow for more optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
FYI, for curiosity's sake I checked with |
@devreal you are absolutely right about pointer aliasing! and that makes the patch necessary to improve performance when the compiler chooses to generate correct code. |
optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu> (cherry picked from commit 1215776)
Hi, this PR changes some repeated pointer accesses in the reduction function range checks, so that they are only performed (the pointer accesses) once. (Hope I didn't miss any, are there any reduction functions in another file?)
I wouldn't expect this change to make a difference, and I probably would expect the compiler to optimize the original code on its own, but I have experimentally observed increased performance with this change. Might well not be the case for a different system or compiler, but the changes shouldn't really have any downside.
Allreduce performance (latency) before patch
10 runs: Average 6374.9 us, Stddev 30.2834 us, Min 6327.88 us, Max 6420.74 us
Allreduce performance (latency) after patch
10 runs: Average 5776.3 us, Stddev 41.5984 us, Min 5705.17 us, Max 5828.30 us
Setup
Disclaimer: Tested on v5.0.0rc2
Signed-off-by: George Katevenis gkatev@ics.forth.gr